NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Galley: Modern Query Optimization for Sparse Tensor Programs

https://doi.org/10.1145/3725301

Deeds, Kyle; Ahrens, Willow; Balazinska, Magdalena; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

The tensor programming abstraction is a foundational paradigm which allows users to write high performance programs via a high-level imperative interface. Recent work onsparse tensor compilershas extended this paradigm to sparse tensors (i.e., tensors where most entries are not explicitly represented). With these systems, users define the semantics of the program and the algorithmic decisions in a concise language that can be compiled to efficient low-level code. However, these systems still require users to make complex decisions about program structure and memory layouts to write efficient programs. This work presents.Galley, a system for declarative tensor programming that allows users to write efficient tensor programs without making complex algorithmic decisions. Galley is the first system to perform cost based lowering of sparse tensor algebra to the imperative language of sparse tensor compilers, and the first to optimize arbitrary operators beyond Σ and *. First, it decomposes the input program into a sequence of aggregation steps through a novel extension of the FAQ framework. Second, Galley optimizes and converts each aggregation step to a concrete program, which is compiled and executed with a sparse tensor compiler. We show that Galley produces programs that are 1-300x faster than competing methods for machine learning over joins and 5-20x faster than a state-of-the-art relational database for subgraph counting workloads with a minimal optimization overhead.
more » « less
Free, publicly-accessible full text available June 17, 2026
Self-Enhancing Video Data Management System for Compositional Events with Large Language Models

https://doi.org/10.1145/3725352

Zhang, Enhao; Sullivan, Nicole; Haynes, Brandon; Krishna, Ranjay; Balazinska, Magdalena (June 2025, Proceedings of the ACM on Management of Data)

Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets.
more » « less
Free, publicly-accessible full text available June 17, 2026
VOCALExplore: Pay-as-You-Go Video Data Exploration and Model Building

https://doi.org/10.14778/3625054.3625057

Daum, Maureen; Zhang, Enhao; He, Dong; Mussmann, Stephen; Haynes, Brandon; Krishna, Ranjay; Balazinska, Magdalena (September 2023, Proceedings of the VLDB Endowment)

We introduce VOCALExplore, a system designed to support users in building domain-specific models over video datasets. VOCALExplore supports interactive labeling sessions and trains models using user-supplied labels. VOCALExplore maximizes model quality by automatically deciding how to select samples based on observed skew in the collected labels. It also selects the optimal video representations to use when training models by casting feature selection as a rising bandit problem. Finally, VOCALExplore implements optimizations to achieve low latency without sacrificing model performance. We demonstrate that VOCALExplore achieves close to the best possible model quality given candidate acquisition functions and feature extractors, and it does so with low visible latency (~1 second per iteration) and no expensive preprocessing.
more » « less
Full Text Available
EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions

https://doi.org/10.14778/3611479.3611482

Zhang, Enhao; Daum, Maureen; He, Dong; Haynes, Brandon; Krishna, Ranjay; Balazinska, Magdalena (July 2023, Proceedings of the VLDB Endowment)

We introduce EQUI-VOCAL: a new system that automatically synthesizes queries over videos from limited user interactions. The user only provides a handful of positive and negative examples of what they are looking for. EQUI-VOCAL utilizes these initial examples and additional ones collected through active learning to efficiently synthesize complex user queries. Our approach enables users to find events without database expertise, with limited labeling effort, and without declarative specifications or sketches. Core to EQUI-VOCAL's design is the use of spatio-temporal scene graphs in its data model and query language and a novel query synthesis approach that works on large and noisy video data. Our system outperforms two baseline systems---in terms of F1 score, synthesis time, and robustness to noise---and can flexibly synthesize complex queries that the baselines do not support.
more » « less
Full Text Available
EQUI-VOCAL Demonstration: Synthesizing Video Queries from User Interactions

Zhang, Enhao; Daum, Maureen; He, Dong; Ganti, Manasi; Haynes, Brandon; Krishna, Ranjay; Balazinska, Magdalena (August 2023, Proceedings of the VLDB Endowment)

We demonstrate EQUI-VOCAL, a system that synthesizes compositional queries over videos from user feedback. EQUI-VOCAL enables users to query a video database for complex events by providing a few positive and negative examples of what they are looking for and labeling a small number of additional system-selected examples. Using those user inputs, EQUI-VOCAL synthesizes declarative queries that can then retrieve additional instances of the desired events. The demonstration makes two contributions: it introduces EQUI-VOCAL’s graphical user interface and enables conference attendees to experiment with EQUI-VOCAL on a variety of queries. Both enable users to gain a better understanding of EQUI-VOCAL’s query synthesis approach and to explore the impact of hyperparameters and label noise on system performance.
more » « less
Full Text Available
SafeBound: A Practical System for Generating Cardinality Bounds

https://doi.org/10.1145/3588907

Deeds, Kyle_B; Suciu, Dan; Balazinska, Magdalena (May 2023, Proceedings of the ACM on Management of Data)

Recent work has reemphasized the importance of cardinality estimates for query optimization. While new techniques have continuously improved in accuracy over time, they still generally allow for under-estimates which often lead optimizers to make overly optimistic decisions. This can be very costly for expensive queries. An alternative approach to estimation is cardinality bounding, also called pessimistic cardinality estimation, where the cardinality estimator provides guaranteed upper bounds of the true cardinality. By never underestimating, this approach allows the optimizer to avoid potentially inefficient plans. However, existing pessimistic cardinality estimators are not yet practical: they use very limited statistics on the data, and cannot handle predicates. In this paper, we introduce SafeBound, the first practical system for generating cardinality bounds. SafeBound builds on a recent theoretical work that uses degree sequences on join attributes to compute cardinality bounds, extends this framework with predicates, introduces a practical compression method for the degree sequences, and implements an efficient inference algorithm. Across four workloads, SafeBound achieves up to 80% lower end-to-end runtimes than PostgreSQL, and is on par or better than state of the art ML-based estimators and pessimistic cardinality estimators, by improving the runtime of the expensive queries. It also saves up to 500x in query planning time, and uses up to 6.8x less space compared to state of the art cardinality estimation methods.
more » « less
DeepEverest: accelerating declarative top-K queries for deep neural network interpretation

https://doi.org/10.14778/3485450.3485460

He, Dong; Daum, Maureen; Cai, Walter; Balazinska, Magdalena (September 2021, Proceedings of the VLDB Endowment)

We design, implement, and evaluate DeepEverest, a system for the efficient execution of interpretation by example queries over the activation values of a deep neural network. DeepEverest consists of an efficient indexing technique and a query execution algorithm with various optimizations. We prove that the proposed query execution algorithm is instance optimal. Experiments with our prototype show that DeepEverest, using less than 20% of the storage of full materialization, significantly accelerates individual queries by up to 63X and consistently outperforms other methods on multi-query workloads that simulate DNN interpretation processes.
more » « less
Full Text Available
VOCAL: Video Organization and Interactive Compositional AnaLytics

Daum, Maureen; Zhang, Enhao; He, Dong; Balazinska, Magdalena; Haynes, Brandon; Krishna, Ranjay; Craig, Apryle; Wirsing, Aaron (January 2022, 12th Annual Conference on Innovative Data Systems Research (CIDR ’22))

Current video database management systems (VDBMSs) fail to support the growing number of video datasets in diverse domains because these systems assume clean data and rely on pretrained models to detect known objects or actions. Existing systems also lack good support for compositional queries that seek events con- sisting of multiple objects with complex spatial and temporal rela- tionships. In this paper, we propose VOCAL, a vision of a VDBMS that supports efficient data cleaning, exploration and organization, and compositional queries, even when no pretrained model exists to extract semantic content. These techniques utilize optimizations to minimize the manual effort required of users.
more » « less
Full Text Available
TASM: A Tile-Based Storage Manager for Video Analytics

https://doi.org/10.1109/ICDE51399.2021.00156

Daum, Maureen; Haynes, Brandon; He, Dong; Mazumdar, Amrita; Balazinska, Magdalena (April 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE))
null (Ed.)
Modern video data management systems store videos as a single encoded file, which significantly limits possible storage level optimizations. We design, implement, and evaluate TASM, a new tile-based storage manager for video data. TASM uses a feature in modern video codecs called "tiles" that enables spatial random access into encoded videos. TASM physically tunes stored videos by optimizing their tile layouts given the video content and a query workload. Additionally, TASM dynamically tunes that layout in response to changes in the query workload or if the query workload and video contents are incrementally discovered. Finally, TASM also produces efficient initial tile layouts for newly ingested videos. We demonstrate that TASM can speed up subframe selection queries by an average of over 50% and up to 94%. TASM can also improve the throughput of the full scan phase of object detection queries by up to 2×.
more » « less
Full Text Available
Sample Debiasing in the Themis Open World Database System

https://doi.org/10.1145/3318464.3380606

Orr, Laurel; Balazinska, Magdalena; Suciu, Dan (May 2020, SIGMOD)

Open world database management systems assume tuples not in the database still exist and are becoming an increas- ingly important area of research. We present Themis, the first open world database that automatically rebalances ar- bitrarily biased samples to approximately answer queries as if they were issued over the entire population. We lever- age apriori population aggregate information to develop and combine two different approaches for automatic debiasing: sample reweighting and Bayesian network probabilistic mod- eling. We build a prototype of Themis and demonstrate that Themis achieves higher query accuracy than the default AQP approach, an alternative sample reweighting technique, and a variety of Bayesian network models while maintaining in- teractive query response times. We also show that Themis is robust to differences in the support between the sample and population, a key use case when using social media samples.
more » « less
Full Text Available

« Prev Next »

Search for: All records